US007793147B2 


(i2) United States Patent 

Stange et al. 


(io) Patent No.: US 7,793,147 B2 

( 45 ) Date of Patent: Sep. 7, 2010 


(54) METHODS AND SYSTEMS FOR PROVIDING 
RECONFIGURABLE AND RECOVERABLE 
COMPUTING RESOURCES 

(75) Inventors: Kent Stange, Phoenix, AZ (US); 

Richard Hess, Glendale, AZ (US); 
Gerald B Kelley, Glendale, AZ (US); 
Randy Rogers, Phoenix, AZ (US) 

(73) Assignee: Honeywell International Inc., 

Morristown, NJ (US) 

( * ) Notice: Subject to any disclaimer, the term of this 

patent is extended or adjusted under 35 
U.S.C. 154(b) by 883 days. 

(21) Appl.No.: 11/458,301 

(22) Filed: Jul. 18, 2006 

(65) Prior Publication Data 

US 2008/0022 1 5 1 Al Jan. 24, 2008 

(51) Int.Cl. 

G06F 11/00 (2006.01) 

(52) U.S. Cl 714/13; 714/12; 714/16 

(58) Field of Classification Search 714/4, 

714/11, 12, 13, 15, 16 

See application file for complete search history. 

(56) References Cited 

U.S. PATENT DOCUMENTS 


4,345,327 

A 

8/1982 

Thuy 

4,453,215 

A 

6/1984 

Reid 

4,751,670 

A 

6/1988 

Hess 

4,996,687 

A 

2/1991 

Hess et al. 

5,086,429 

A 

2/1992 

Gray et al. 

5,313,625 

A 

5/1994 

Hess et al. 

5,550,736 

A 

8/1996 

Hay et al. 

5,732,074 

A 

3/1998 

Spaur et al. 

5,757,641 

A 

5/1998 

Minto 

5,903,717 

A 

5/1999 

Waidrop 


5,909,541 A 6/1999 Sampson et al. 

5,915,082 A 6/1999 Marshall et al. 

(Continued) 

FOREIGN PATENT DOCUMENTS 
EP 0363863 4/1990 

(Continued) 

OTHER PUBLICATIONS 

Lee, “Design and Evaluation of a Fault-Tolerant Multiprocessor 
Using Hardware Recovery Blocks”, Aug. 1982, pp. 1-19, Publisher: 
University of Michigan Computing Research Laboratory, Published 
in: Ann Arbor, MI. 

(Continued) 

Primary Examiner — Joshua A Lohn 

(74) Attorney, Agent, or Firm — Fogg & Powers LLC 

(57) ABSTRACT 

A method for optimizing the use of digital computing 
resources to achieve reliability and availability of the com- 
puting resources is disclosed. The method comprises provid- 
ing one or more processors with a recovery mechanism, the 
one or more processors executing one or more applications . A 
determination is made whether the one or more processors 
needs to be reconfigured. A rapid recovery is employed to 
reconfigure the one or more processors when needed. A com- 
puting system that provides reconfigurable and recoverable 
computing resources is also disclosed. The system comprises 
one or more processors with a recovery mechanism, with the 
one or more processors configured to execute a first applica- 
tion, and an additional processor configured to execute a 
second application different than the first application. The 
additional processor is reconfigurable with rapid recovery 
such that the additional processor can execute the first appli- 
cation when one of the one more processors fails. 

20 Claims, 5 Drawing Sheets 
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METHODS AND SYSTEMS FOR PROVIDING 
RECONFIGURABLE AND RECOVERABLE 
COMPUTING RESOURCES 

The U.S. Government may have certain rights in the 
present invention as provided for by the terms of Contract No . 
NCC-1-393 with NASA. 

BACKGROUND TECHNOLOGY 

Computers have been used in digital control systems in a 
variety of applications, such as in industrial, aerospace, medi- 
cal, scientific research, and other fields. In such control sys- 
tems, it is important to maintain the integrity of the data 
produced by a computer. In conventional control systems, a 
computing unit for a plant is typically designed such that the 
resulting closed loop system exhibits stability, low-frequency 
command tracking, low-frequency disturbance rejection, and 
high-frequency noise attenuation. The “plant” can be any 
object, process, or other parameter capable of being con- 
trolled, such as aircraft, spacecraft, medical equipment, elec- 
trical power generation, industrial automation, a valve, a 
boiler, an actuator, or other controllable device. 

It is well recognized that computing system components 
may fail during the course of operation from various types of 
failures or faults encountered during use of a control system. 
For example, a “hard fault” is a fault condition typically 
caused by a permanent failure of the analog or digital cir- 
cuitry. For digital circuitry, a “soft fault” is typically caused 
by transient phenomena that may affect some digital circuit 
computing elements resulting in computation disruption, but 
does not permanently damage or alter the subsequent opera- 
tion of the circuitry. For example, soft faults may be caused by 
electromagnetic fields created by high-frequency signals 
propagating through the computing system. Soft faults may 
also result from spurious intense electromagnetic signals, 
such as those caused by lightning that induce electrical tran- 
sients on system lines and data buses which propagate to 
internal digital circuitry setting latches into erroneous states. 

Unless the computing system is equipped with redundant 
components, one component failure normally means that the 
system will malfunction or cease all operation. A malfunction 
may cause an error in the system output. Fault tolerant com- 
puting systems are designed to incorporate redundant com- 
ponents such that a failure of one component does not affect 
the system output. This is sometimes called “masking.” 

In conventional control systems, various forms of redun- 
dancy have been used in an attempt to reduce the effects of 
faults in critical systems. Multiple processing units, for 
example, may be used within a computing system. In a system 
with three processing units, for example, if one processor is 
determined to be experiencing a fault, that processor may be 
isolated and/or shut down. The fault may be corrected by 
correct data, such as the current values of various control state 
variables, being transmitted (or “transfused”) from the 
remaining processors to the isolated unit. If the faults in the 
isolated unit are corrected, the processing unit may be re- 
introduced to the computing system. 

Functional reliability is often achieved by implementing 
redundancy in the system architecture whereby the level of 
redundancy is preserved without effects on the function being 
provided. Availability can be achieved by allocating extra 
hardware resources to maintain functional operation in the 
presence of faulted elements. There is a need, however, to 
minimize the hardware resources necessary to support reli- 
ability requirements and availability requirements in control 
systems. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Features of the present invention will become apparent to 
those skilled in the art from the following description with 
5 reference to the drawings. Understanding that the drawings 
depict only typical embodiments of the invention and are not 
therefore to be considered limiting in scope, the invention will 
be described with additional specificity and detail through the 
use of the accompanying drawings, in which: 
to FIG. 1 is a block diagram of one embodiment of a recon- 
figurable and recoverable computing system; 

FIG. 2 is a block diagram of another embodiment of a 
reconfigurable and recoverable computing system; 

FIG. 3 is a block diagram of a further embodiment of a 
15 reconfigurable and recoverable computing system; 

FIG. 4 is a processing flow diagram for a method for 
optimizing the use of digital computing resources to achieve 
reliability and availability; and 

FIG. 5 is a block diagram illustrating a fault recovery 
20 system according to one embodiment. 

DETAILED DESCRIPTION 

The present invention relates to methods and systems for 
25 providing one or more computing resources that are recon- 
figurable and recoverable wherever digital computing is 
applied, such as in a digital control system. The methods of 
the invention also provide for optimizing the use of digital 
computing resources to achieve reliability and availability of 
30 the computing resources . Such a method compri ses providing 
one or more processors with a recovery mechanism, with the 
one or more processors executing one or more applications . A 
determination is made whether the one or more processors 
needs to be reconfigured. A rapid recovery is employed to 
35 reconfigure the one or more processors when needed. State 
data is continuously updated in the recovery mechanism, and 
the state data is used to transfuse the one or more processors 
for reconfiguration. This method provides for real-time 
reconfiguration transitions, and allows for a minimal set of 
40 hardware to achieve reliability and availability. 

In general, reconfiguration is an action taken due to non- 
recoverable events (e.g., hard faults or hard failure) or use 
requirements (e.g., flight mission phase). A recovery action is 
generally taken due to a soft fault. The invention provides for 
45 application of a recovery action during a reconfiguration 
action. This combination of actions lessens the reconfigura- 
tion time and optimizes computing resource utilization. This 
combination of actions facilitates a more rapid reconfigura- 
tion of a computational element because current state data is 
50 maintained within a rapid recovery mechanism of a comput- 
ing unit. The reconfiguration state data is pre-initialized with 
the state data maintained in a computing resource with rapid 
recovery capability, which allows a reconfigured computing 
resource to be brought on line much faster than if the state data 
55 were not available. The reconfiguration is rapid enough so 
that input/output staleness is not an issue. 

Typically, the reconfiguration starts from or ends in a 
redundant/critical system. The hardware can be reconfig- 
urable or can have a superset of functions. The invention 
60 enables a reduction in hardware that is employed to achieve 
reliability and availability for functions being provided by a 
digital computing system so that only a minimal set of hard- 
ware is required. The invention also enables the design of 
electronic system architectures that can better optimize the 
65 utilization of computing resources. 

The rapid recovery mechanism may also be used to mini- 
mize the set of computing resources required to support vary- 
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ing computing resources throughout a specified use such as a An exemplary electronic system architecture in which the 

mission. In phases where maximum reliability is required, present invention can be used includes one or more proces- 

computing resources may be reconfigured to perform redun- sors, each of which can be configured for rapid recoveiy from 

dant functionality. The reconfiguration occurs in a minimal various faults. The term “rapid recovery” indicates that recov- 


time lag since the state data is maintained in the rapid recov- 
ery mechanism. In other phases of a mission where additional 
functionality is required to be available, the system may be 
reconfigured to provide the additional computing resources 
and may revert to the high integrity configuration at anytime 
since the state data is maintained in the rapid recovery mecha- 
nism. A typical system without a rapid recovery mechanism 
would require additional hardware to provide functionality 
that is only required during parts of a mission and would not 
be immediately reconfigurable to a higher reliability archi- 
tecture by reutilizing hardware resources. 

Further details with respect to the rapid recovery mecha- 
nism can be found in copending U.S. application Ser. No. 
11/058,764, filed on Feb. 16, 2005, and entitled “FAULT 
RECOVERY FOR REAL-TIME, MULTI-TASKING COM- 
PUTER SYSTEM,” the disclosure of which is incorporated 
herein by reference. 

In the following description, various embodiments of the 
present invention may be described in terms of various com- 
puter architecture elements and processing steps. It should be 
appreciated that such elements may be realized by any num- 
ber of hardware or structural components configured to per- 
form specified operations. For purposes of illustration only, 
exemplary embodiments of the present invention are some- 
times described herein in connection with aircraft avionics. 
The invention is not so limited, however, and the systems and 
methods described herein may be used in any control envi- 
ronment. Further, it should be noted that although various 
components may be coupled or connected to other compo- 
nents within exemplary system architectures, such connec- 
tions and couplings can be realized by direct connection 
between components, or by connection through other com- 
ponents and devices located therebetween. The following 
detailed description is, therefore, not to be taken in a limiting 
sense. 

Instructions for carrying out the various process tasks, 
calculations, control functions, and the generation of signals 
and other data used in the operation of the systems and meth- 
ods of the invention can be implemented in software, firm- 
ware, or other computer readable instructions. These instruc- 
tions are typically stored on any appropriate computer 
readable medium used for storage of computer readable 
instructions or data structures. Such computer readable media 
can be any available media that can be accessed by a general 
purpose or special purpose computer or processor, or any 
programmable logic device. 

Suitable computer readable media may comprise, for 
example, non-volatile memory devices including semicon- 
ductor memory devices such as EPROM, EEPROM, or flash 
memory devices; magnetic disks such as internal hard disks 
or removable disks (e.g., floppy disks); magneto -optical 
disks; CDs, DVDs, or other optical storage disks; nonvolatile 
ROM, RAM, and other like media. Any of the foregoing may 
be supplemented by, or incorporated in, specially -designed 
application- specific integrated circuits (ASICs). When infor- 
mation is transferred or provided over a network or another 
communications connection (either hardwired, wireless, or a 
combination of hardwired or wireless) to a computer, the 
computer properly views the connection as a computer read- 
able medium. Thus, any such connection is properly termed a 
computer readable medium. Combinations of the above are 
also included within the scope of computer readable media. 


5 ery may occur in a very short amount of time, such as within 
about 1 to 2 computing frames. As used herein, a “computing 
frame” is the time needed for a particular processor to per- 
form a repetitive task of a computation, e.g., the tasks that 
need to be calculated continuously to maintain the operation 
10 of a controlled plant. In embodiments where faults are 
detected within a single computing frame, each processor 
need only store control and logic state variable data for the 
immediately preceding computing frame for use in recovery 
purposes, which may take place essentially instantaneously 
15 so that it is transparent to the user. 

The invention provides for use of common computing 
resources that can be both reconfigurable and rapidly recov- 
erable. For example, a common computing module can be 
provided that is both reconfigurable and rapidly recoverable 
20 to provide aerospace vehicle functions. Typically, aerospace 
vehicle functions can have failure effects ranging from cata- 
strophic to no effect on mission success or safety. In control 
functions requiring rapid real time recovery (e.g., aircraft 
inner loop stability), the computing module capability pro- 
25 vides recovery that is rapid enough such that there would be 
no effect perceived at the function level. Thus, the recovery is 
transparent to the function. 

In general, a computing system according to embodiments 
30 of the invention provides reconfigurable and recoverable 
computing resources. Such a system comprises one or more 
processors with a recovery mechanism, the processors con- 
figured to execute a first application, and a first additional 
processor configured to execute a second application differ- 
35 ent than the first application. The additional processor is 
reconfigurable with rapid recovery such that the additional 
processor can execute the first application when one of the 
one more processors fails. In another embodiment, the system 
further comprises a second additional processor configured to 
4Q execute a third application different from the first and second 
applications. The second additional processor is reconfig- 
urable such that it can execute the second application if the 
first additional processor fails. 

In the following description of various exemplary embodi- 
45 ments of the invention, a particular number of processors are 
described for each of the computing systems. It should be 
understood, however, that other embodiments can perform 
the same functions as described with more or less processors. 
Thus, the following embodiments are not to be taken as lim- 
50 iting. In addition, some processors are associated with 
optional recovery mechanisms, since these processors don’t 
always need to store state data to perform their functions 
when reconfigured. 

FIG. 1 depicts a system in which reconfiguration utilizing 
55 rapid recovery teleology is provided for maintaining reliabil- 
ity of a control system. As shown, a fault tolerant computing 
system in a first configuration 100a has a set of three com- 
puting resources 110 , 120 , and 130 that are configured to 
execute an application A by respective processors 1 , 2 , and 3 . 
60 Recovery mechanisms 112 , 122 , and 132 are also respec- 
tively provided in computing resources 110 , 120 , and 130 . 
Each computing resource 110 , 120 , and 130 provides an 
independent output that is operatively connected to a decision 
logic module 150 . The decision logic module 150 implements 
65 an algorithm that maintains an appropriate action 160 in the 
event that one of the processors sends an erroneous output to 
decision logic module 150 . 
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The minimum number of processors required to implement 
this scheme is three because only then is it possible to tell 
which processor is in error by comparison to the outputs of the 
other processors. Assuming that all three processors are oper- 
ating correctly from the start and that only one fails at a time, 
then it is possible for the decision logic to continue to provide 
an error- free action even after a single processor has failed. 
The problem is that a second failure would make it impossible 
for the decision logic to continue to provide the appropriate 
action because it is not possible with only two inputs to tell 
which processor has failed. One solution is to have four or 
more processors executing the same application so it is pos- 
sible to continue correct operation after the second failure. 

As depicted in FIG. 1, a fourth computing resource 140 is 
provided with an optional recovery mechanism 142. The pro- 
cessor 4 of computing resource 140 is not initially executing 
the same application A as computing resources 110, 120, and 
130. Instead, processor 4 is executing application B. The 
processor 4 does not need to execute application A because 
three outputs are sufficient for decision logic module 150 to 
decide which processor has failed for the first failure. The first 
configuration 100a is reconfigured (170), after one of proces- 
sors 1-3 has failed, into a second configuration 100/?. Proces- 
sor 4 is used to execute application A and provide the third 
output to decision logic module 150 that was formerly being 
provided by the now failed processor. For example, if proces- 
sor 3 fails it is stopped from affecting the control output being 
sent from decision logic module 150 and is replaced by pro- 
cessor 4, which is reconfigured with recovery data from pro- 
cessor 1/application A to maintain the redundancy level. Uti- 
lizing such reconfiguration and rapid recovery minimizes the 
hardware resources required to support both reliability and 
availability. 

The system architecture of FIG. 1 provides the ability to 
reconfigure a processor and begin executing a different appli- 
cation when needed. To ensure that the system provides the 
required level of reliability as before, the reconfiguration 
must occur in a sufficiently short time that the probability of 
the second processor failure occurring between the time that 
the first failure occurs and the reconfiguration is completed is 
very small. The recovery mechanisms in the computing 
resources store state information relevant to the executing 
application. In the event of one or more computing errors, it is 
possible for a processor to continue executing using the stored 
state information that was previously saved during an earlier 
computation cycle. This same state data is also used to rapidly 
reconfigure the fourth processor to execute a critical applica- 
tion in the event of a non-recoverable error in any of the three 
redundant processors. Without this state data, the amount of 
time required to bring another processor on-line would be 
greatly extended. 

FIG. 2 illustrates a fault tolerant computing system accord- 
ing to another embodiment that employs a reconfiguration 
method utilizing rapid recovery to minimize the hardware 
computing resources needed to achieve and maintain required 
functional availability. A first configuration 200a of the com- 
puting system has a first computing platform 202 and a sec- 
ond computing platform 204, such as left and right cabinets in 
a flight control computer system. The computing platform 
202 includes a set of computational resources 210 and 220. 
The computing platform 204 includes a set of computational 
resources 230 and 240. Recovery mechanisms 212, 222, and 
232, are respectively provided in computational resources 
210, 220, and 230. The computational resource 240 is pro- 
vided with an optional recovery mechanism 242. 

The computational resources 210 and 220 are configured to 
respectively execute applications A and B by respective pro- 
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cessors 1 and 2. The computational resources 230 and 240 are 
configured to respectively execute applications A and C by 
respective processors 3 and 4. Thus, application A is redun- 
dantly hosted in processors 1 and 3. Application B is not 
5 redundant and is hosted only in processor 2. Application C, 
which has the least critical function is hosted only in proces- 
sor 4. 

As shown in FIG. 2, rapid recovery is used with a consistent 
set of state data to reconfigure (270) the first configuration 
10 200a, which is hosting a non-essential application, into a 
second configuration 200 b. For example, application B state 
data is continuously updated in recovery mechanism 222. If 
an unrecoverable failure is detected in processor 2, processor 
4 is reconfigured with recovery data from processor 2 in order 
15 to host application B and thus maintain the availability of 
application B. Application C originally running on processor 
4 is not required to meet the minimum system functionality 
and hence is superseded by the more critical application B. 

The availability of fresh and consistent state data provided 
20 by the rapid recovery technique ensures rapid initialization of 
critical applications. Reconfiguration allows the system to 
meet functional availability requirements without immediate 
removal and replacement of a faulted computational element. 
Without rapid recovery, starting application B on processor 4 
25 would require a lengthy initialization period to become ini- 
tialized and synchronized with the system. An immediate 
maintenance action would be required to diagnose and 
replace the faulty computational element and then restart the 
system without reconfiguration. 

30 In a further embodiment, a computing system employs 
reconfiguration and rapid recovery to minimize hardware 
resources required to support both reliability and availability. 
The speed of the reconfiguration transition can be essentially 
real-time when rapid recovery is used. The transitions 
35 between system configurations are used to achieve reliability 
and availability of the computational elements. 

In a first configuration of this computing system, a number 
of independent applications are executed on independent 
computational resources. For example, a computing system 
40 can include a first processor with a recovery mechanism that 
is configured to execute a first application, and one or more 
additional processors configured to execute one or more 
applications that are different from the first application. The 
first configuration is employed to achieve an availability of 
45 functions during a particular phase of a use, such as a flight 
mission for example. A first application is executed on one of 
the computational resources, which utilizes rapid recovery to 
create a reliable backup of state data variables. Other appli- 
cations are executed on the additional computational 
50 resources. 

During the next phase of a use such as a mission, the first 
application needs to support a highly reliable operation. This 
is achieved in the computing system architecture by imple- 
menting a redundancy of computing resources in a second 
55 configuration to achieve reliability. For example, the one or 
more additional processors are reconfigurable such that they 
can execute the first application when needed for redundancy. 
The one or more additional processors are reconfigured with 
recovery data from the first application/processor. Addition- 
60 ally, the one or more additional processors can be further 
reconfigured to execute the one or more applications again 
that are different from the first application when redundancy 
is no longer required. 

This embodiment is further illustrated in FIG. 3. A com- 
65 puting system in a first configuration 300a includes a set of 
three computational resources 310, 320, and 330 that are 
configured to respectively execute different applications A, 
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B, and C by respective processors 1, 2, and 3. A recovery 
mechanism 312 is provided in computational resource 310 
for rapid recovery. The computational resources 320 and 330 
can include optional recovery mechanisms 322 and 332, 
respectively. 5 

If application A needs to support a highly reliable opera- 
tion, computational resources 320 and 330 are reconfigured 
(370) to become redundant channels for application A as 
shown in a second configuration 300/? of FIG. 3. Each of 
computational resources 310, 320, and 330 in configuration 10 
300 b can provide an independent output that is fed to a deci- 
sion logic module 350. The decision logic module 350 imple- 
ments an algorithm that maintains an appropriate action 360 
in the event that one of the processors in computational 
resources 310, 320, or 330 sends an erroneous output to 15 
decision logic module 350. 

Once the highly reliable operation is no longer needed, the 
computing system can be returned (380) to the first configu- 
ration 3 00a . In a cyclic scenario, the computing system can be 
reconfigured between first and second configurations 300a 20 
and 300/? as often as needed for a particular use. 

Without rapid recovery, the initial states of the reconfig- 
ured computational resources 320 and 330 (with processors 2 
and 3) would not be in-sync with application A executing on 
processor 1. It would typically require some time period of 25 
operation before the states of the re-configured computational 
resources (processors 2 and 3) would reach the same state as 
the original application A on processor 1. But with rapid 
recovery, the operational state variables of application A on 
processor 1 from a previous computing frame can be loaded 30 
into the reconfigured processors 2 and 3 just prior to their 
execution of application A. This allows the initial states of the 
reconfigured computational resources to be essentially in- 
sync with the original state of processor 1. 

FIG. 4 illustrates a method for optimizing the use of digital 35 
computing resources to achieve reliability and availability. At 
least one computational resource 410 is provided with a pro- 
cessor 412 that is configured to execute an application 414. A 
recovery mechanism 416 is provided in computational 
resource 41 0 for rapid recovery. One or more additional com- 40 
putational resources 410(N) can be optionally provided with 
one or more processors 412(N) if desired depending upon the 
use intended for the computational resources. Such additional 
computational resources can be configured to execute one or 
more applications 414(N), which can be the same as or dif- 45 
ferent from application 414. The additional computational 
resources can include an optional recovery mechanism 416 
(N) if desired. 

During operation, a determination is made at 420 whether 
reconfiguration is required for computational resource 410 50 
(and when present, computational resources 410(N)). If not, 
then computational resource(s) 410 (410(N)) continues nor- 
mal operations in executing application(s) 414 (414(N)). If 
reconfiguration is required, then a rapid recovery is initiated 
at 430 using state data stored in recovery mechanism(s) 416 55 
(416(N)). The reconfiguration of processor(s) 412 (412(N)) is 
complete at 440 after rapid recovery occurs. 

In one embodiment, a recoverable real time multi-tasking 
computer system is provided. The system comprises a real 
time computing platform, wherein the real time computing 60 
platform is adapted to execute one or more applications, 
wherein each application is time and space partitioned. The 
system further comprises a fault detection system adapted to 
detect one or more faults affecting the real time computing 
environment, and a fault recovery system. Upon the detection 65 
of a fault by the fault detection system, the fault recovery 
system is adapted to restore a backup set of state variables. 
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In one embodiment, lock-step fault detection allows a sys- 
tem to detect upset events almost immediately. Traditional 
lock step processing implies that two or more processors are 
executing the same instructions at the same time. Self-check- 
ing lock-step computing provides the cross feeding of signals 
from one processing lane to the other processing lane and then 
compares them for deviations on every single clock edge. 

FIG. 5 illustrates one embodiment 500 of a self-checking 
lock-step computing lane 510 of one embodiment of the 
present invention. Self-checking lock- step computing lane 
510 comprises at least two sets of duplicate processors (512 
and 514), memories (520 and 522), and fault detection moni- 
tors (516 and 518). On every single system clock edge, moni- 
tors 51 6 and 518 both compare the data bus signal and control 
bus signal output of processors 512 and 514 against each 
other. When the output signals fail to correlate, monitors 516 
and 518 identify a fault. This guarantees that if one processor 
deviates (e.g., because it retrieves a wrong address or is pro- 
vided a wrong data bit) one or both of monitors 516 and 518 
will detect the fault on the next clock edge. The fault is thus 
detected in the same computational frame in which it was 
generated. In one embodiment, when either monitor 516 or 
monitor 518 detects a fault, the monitor notifies processors 
512 and 514. In embodiments of the present invention, upon 
notification of a fault, processors 512 and 514 shut off further 
processing of the application which was executing in the 
faulted computational frame and the fault recovery system is 
invoked. 

In operation, in one embodiment, processors 512 and 514 
hold state variables for applications in respective memories 
520 and 522. The memory locations in memories 520 and 522 
used by each application to store state variables as the appli- 
cations are executed in their respective computational frame 
are referred to as “scratchpad memories.” Fault recovery sys- 
tem 530 creates a duplicate copy of the state variables stored 
in memories 520 and 522, creating a repository of recent state 
variable data sets. Fault recovery system 530 stores off the 
state variables in real time, as processors 512 and 514 are 
executing and storing the state variables in memories 520 and 
522. 

In one embodiment, as state variable values are produced 
by processors 512 and 514 and stored in memories 520 and 
522, there is a redundant copy made in duplicate memory 538. 
In one embodiment, duplicate memory 538 is contained in a 
highly isolated location to ensure the robustness of the data 
stored in duplicate memory 538. In one embodiment, dupli- 
cate memory 538 is protected from corruption by one or more 
of a metal enclosure, signal buffers (such as buffers 544 and 
546) and power isolation. 

One skilled in the art will recognize that it is undesirable to 
load duplicate memory 538 with state variable data in situa- 
tions where the system only partially completed a computing 
frame when the fault occurred. This is because duplicate 
memory 538 could end up storing corrupted data for that 
computing frame. Instead, to ensure that a complete valid 
frame of state variable data is in the duplicate memory and 
available for restoration, embodiments of the present inven- 
tion provide intermediate memories. In one embodiment, a 
duplicate of memories 520 and 522 for even computational 
frames is loaded into even frame memory 534. A duplicate of 
memories 520 and 522 for odd computational frames is 
loaded into odd frame memory 536. The even frame memory 
534 and odd frame memory 536 toggle back and forth copy- 
ing data into the duplicate memory 538 to ensure that a 
complete valid backup memory is maintained. Even frame 
memory 534 and odd frame memory 536 will only copy their 
contents to duplicate memory 538 if the intermediate memo- 
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ries themselves contain a complete valid state variable 
backup for a computing frame that successfully completes, its 
execution. 

In one embodiment, fault recovery system 530 also 
includes a variable identity array 542, which provides for the 
efficient use of memory storage. In one embodiment, instead 
of creating backup copies of every state variable for every 
application, variable identity array 542 identifies a subset of 
predefined state variables which allows recovery control 
logic 532 to backup only those state variables desired for 
certain applications into duplicate memory 538. In one 
embodiment, only state variables for predefined applications 
are included in the predefined subset of state variables that are 
duplicated into duplicate memory 538. In one embodiment, 
variable identity array 542 contains predefined state variable 
locations on an address-by -address basis. In one embodi- 
ment, variable identity array 542 allows only the desired state 
variable data to load into the intermediate memories. 

When recovery control logic 532 is notified of a detected 
fault, recovery control logic 532 retrieves the duplicate state 
variables for an upset application from duplicate memory 538 
and restores those state variables into the upset application’s 
scratchpad memory area of memories 520 and 522. In one 
embodiment, once the duplicate state variables are restored 
into memories 520 and 522, recovery control logic 532 noti- 
fies monitors 516 and 518, and processors 212, 214 resume 
execution of the upset application using the restored state 
variables. 

In another embodiment of the present invention, monitors 
516 and 518 are adapted to notify the faulted application of 
the occurrence of a fault, instead of notifying recovery control 
logic 532. In operation, in one embodiment, upon detection of 
a fault affecting an application, the monitor notifies proces- 
sors 512 and 514, which shut off processing of the upset 
application. On the upset application’s next processing 
frame, at least one of processors 512 and 514 notify the 
faulted application of the occurrence of the fault. In one 
embodiment, upon notification of the fault, the upset appli- 
cation is adapted to request the recovery of state variables by 
notifying recovery control logic 532. In one embodiment, 
once the duplicate state variables are restored into memories 
520 and 522, recovery control logic 532 notifies monitors 516 
and 518, and processors 512 and 514 resume execution of the 
upset application using the restored state variables. 

The present invention may be embodied in other specific 
forms without departing from its essential characteristics. 
The described embodiments and methods are to be consid- 
ered in all respects only as illustrative and not restrictive. The 
scope of the invention is therefore indicated by the appended 
claims rather than by the foregoing description. All changes 
that come within the meaning and range of equivalency of the 
claims are to be embraced within their scope. 

What is claimed is: 

1. A method for optimizing the use of digital computing 
resources to achieve reliability and availability of the digital 
computing resources, the method comprising: 

providing one or more processors with a recovery mecha- 
nism, the one or more processors executing one or more 
applications, the recovery mechanism comprising: 
a duplicate memory; 

an even frame memory, wherein the recovery mecha- 
nism is configured to duplicate state variables com- 
puted by a real time computing platform dining even 
computational frames into the even frame memory; 
and 

an odd frame memory, wherein the recovery mechanism 
is configured to duplicate state variables computed by 
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the real time computing platform during odd compu- 
tational frames into the odd frame memory; 
wherein the even frame memory and the odd frame 
memory toggle back and forth duplicating state vari- 
5 ables into the duplicate memory for computational 

frames in which no fault is detected; 

determining whether the one or more processors needs to 
be reconfigured; and 

employing a rapid recovery to reconfigure the one or more 
to processors when needed; 

wherein upon determining the need to reconfigure, the 
recovery mechanism restores a duplicate set of state 
variables into one or more scratchpad memories for the 
one or more processors. 

15 2. The method of claim 1, wherein state data is continu- 

ously updated in the recovery mechanism. 

3. The method of claim 2, wherein the state data is used to 
transfuse the one or more processors for reconfiguration. 

4. The method of claim 1, wherein the method provides 
20 real-time reconfiguration transitions. 

5. The method of claim 1, wherein the method provides for 
one less than a predetermined set of processing hardware to 
achieve reliability and availability for functions being pro- 
vided. 

25 6. An electronic system architecture that is configured to 

implement the method of claim 1. 

7. A computing system that provides reconfigurable and 
recoverable computing resources, the system comprising: 

a real time computing platform; 

30 one or more scratchpad memories in the computing plat- 
form; 

one or more processors with a recovery mechanism in the 
computing platform, the one or more processors config- 
ured to execute a first application, the recovery mecha- 
35 nism comprising: 

a duplicate memory; 

an even frame memory, wherein the recovery mecha- 
nism is configured to duplicate state variables com- 
puted by the real time computing platform during 
40 even computational frames into the even frame 

memory; and 

an odd frame memory, wherein the recovery mechanism 
is configured to duplicate state variables computed by 
the real time computing platform during odd compu- 
45 tational frames into the odd frame memory; 

wherein the even frame memory and the odd frame 
memory toggle back and forth duplicating state vari- 
ables into the duplicate memory for computational 
frames in which no fault is detected; 

50 a first additional processor configured to execute a second 
application different than the first application; 

wherein the additional processor is reconfigurable with 
rapid recovery such that the additional processor can 
execute the first application when one of the one more 
55 processors fails; and 

wherein upon determining a need to reconfigure, the recov- 
ery mechanism restores a duplicate set of state variables 
into the one or more scratchpad memories. 

8. The system of claim 7, wherein the one or more proces- 
60 sors are in operative communication with a decision logic 

module prior to any failure. 

9. The system of claim 8, wherein the decision logic mod- 
ule implements an algorithm that maintains an appropriate 
action in the event that one of the one or more processors 

65 sends an erroneous output to the decision logic module. 

1 0 . The system of claim 8, wherein the additional processor 
is in operative communication with the decision logic module 
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when the additional processor is reconfigured, and a failed 
processor is removed from communication with the decision 
logic module. 

11 . The system of claim 7, wherein the additional processor 
is reconfigured with recovery data from a processor that has 
not failed to maintain a level of redundancy. 

12. The system of claim 7, wherein the recovery mecha- 
nism in the one or more processors stores state data relevant to 
executing the first application. 

13. The system of claim 12, wherein the state data is used 
to reconfigure the additional processor. 

14. The system of claim 7, further comprising a second 
additional processor configured to execute a third application 
different from the first and second applications. 

15. The system of claim 14, wherein the second additional 
processor is reconfigurable such that it can execute the second 
application if the first additional processor fails. 

16. The system of claim 14, wherein the second additional 
processor is reconfigured with a consistent set of state data 
from a recovery mechanism of the first additional processor. 

17. A computing system that provides reconfigurable and 
recoverable computing resources, the system comprising: 

a real time computing platform; 

a scratchpad memory in the computing platform; 

a first processor with a recovery mechanism in the com- 
puting platform, the first processor configured to execute 
a first application, the recovery mechanism comprising: 
a duplicate memory; 

an even frame memory, wherein the recovery mecha- 
nism is configured to duplicate state variables com- 
puted by the real time computing platform during 
even computational frames into the even frame 
memory; and 
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an odd frame memory, wherein the recovery mechanism 
is configured to duplicate state variables computed by 
the real time computing platform during odd compu- 
tational frames into the odd frame memory; 

5 wherein the even frame memory and the odd frame 
memory toggle back and forth duplicating state vari- 
ables into the duplicate memory for computational 
frames in which no fault is detected; 

one or more additional processors configured to execute 
10 one or more applications that are different from the first 
application; 

wherein the one or more additional processors are recon- 
figurable such that they can execute the first application 
when needed for redundancy while the first processor is 
15 executing the first application, and wherein the one or 
more additional processors can be further reconfigured 
to execute the one or more applications again that are 
different from the first application when redundancy is 
no longer required; and 

20 wherein upon determining a need to reconfigure, the recov- 

ery mechanism restores a duplicate set of state variables 
into the scratchpad memory. 

18. The system of claim 17, wherein state data is continu- 
ously updated in the recovery mechanism of the first proces- 

25 sor. 

19. The system of claim 18, wherein the one or more 
additional processors are reconfigured with a consistent set of 
state data from the recovery mechanism of the first processor. 

20. The system of claim 17, wherein the first processor and 
30 the one or more additional processors are in operative com- 
munication with a decision logic module after reconfigura- 
tion of the one or more additional processors. 



